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Abstract 

A number of known techniques for improving cache performance in scientific compu- 
tations involve the reordering of the iteration space. Some of these reorderings can 
be considered as coverings of the iteration space with the sets having good surtace- 
to- volume ratio. Use of such sets reduces the number of cache misses in compu- 
tations of local operators having the iteration space as a domain. First, we derive 
lower bounds which any algorithm must suffer while computing a local operator on 
a grid. Then we explore coverings of iteration spaces represented by structured and 
unstructured grids which allow us to approach these lower bounds. For structured 
grids we introduce a covering by successive minima tiles of the interference lattice 
of the grid. We show that the covering has low surface-to- volume ratio and present 
a computer experiment showing actual reduction of the cache misses achieved y 
using these tiles. For planar unstructured grids we show existence of a covering 
which reduces the number of cache misses to the level of structured grids. On the 
other hand, we present a triangulation of a 3-dimensional cube such that any loca 
operator on the corresponding grid has significantly larger number of cache misses 
than a similar operator on a structured grid. 
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1 Introduction 

A number of known techniques for improving cache performance in scientific compu- 
tations involve the reordering of the iteration space. We present two new methods 
for partitioning the iteration space with minimum-surface cache fitting sets. Such 
partitionings reduce the number of cache misses to a level that is close to the the- 
oretical minimum. We show that the coverings reduce the number of the misses 
by actual measurements of cache misses in computations of a second order stencil 
operator on structured three-dimensional grids. 

A good tiling of the iteration space for structured discretization grids can be 
constructed by using the interference lattice of the grid. This lattice is a set of grid 
indices mapped into the same word in the cache or, equivalently, a set of solutions 
of the Cache Miss Equation [7]. In [2] we introduced a (generally skewed) tding of 
the iteration space of the explicit operators on structured grids with parallelepipeds 
built on a reduced basis of the interference lattice. We showed that for lattices 
whose second shortest vector is relatively long the tiling reduces the number of 
cache misses to a value close to the theoretical lower bound. Constructing the 
skewed tiling, however, is a nontrivial task, and involves a significant overhead in 
testing whether a particular point lies inside the tile. Tiling a three-dimensional 
grid, for example, requires the determination of 29 integer parameters to construct 
the loop nest of depth six, and involves a significant branching overhead. 

We start the paper with deriving lower bounds which any algorithm must 
suffer while computing an a local operator on a grid. 

Then we introduce two new, coverings of structured grids: a covering with 
Voronoi cells and a covering with rectilinear parallelepipeds built on the vectors 
of successive minima of the interference lattice. In lattices with a relatively long 
shortest vector the cells of both coverings have near-minimal surface-to- volume ra- 
tios. Hence, the number of cache misses in the computations tiled with these cells 
is close to the theoretical minimum derived in [2]. Direct measurements of the 
cache misses show a significant advantage of the successive minima covering rela- 
tive to the computations using the canonical loop ordering (maximally optimized 
by a compiler). 

For the computations of local explicit operators on planar unstructured grids 
we construct a near-minimum-perimeter covering by applying the Lipton-Tarjan 
planar graph separator algorithm [8]. The perimeter-to-area ratio of the sets of this 
covering is 0(l/\/S), where S is the cache size. Lastly, we construct an unstructured 
grid that triangulates a 3-dimensional cube and show that the grid can not be 
covered with sets having small surface-to-volume ratios. The last result shows 
that any computation of an explicit operator on such 3-dimensional grid would 
suffer larger number of cache misses than a computation of a similar operator on a 
structured grid of the same size. 


2 Cache Usage in Computations of Local Operators 

Local operators on the grids. We consider the problem of computing a local explicit 
operator q = Ku on data defined at the vertices of an undirected graph G = (V, E) 
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which we call grid. Locality of the operator K means that computation of q(x), x € 
V , involves values of u(y), y E V , where y is at a (graph) distance at most r from 
x. This r is called the order of K and assumed to be independent of G. K is 
explicit, meaning that q and u are distinct arrays and, hence, the values of q can 
be computed in arbitrary order. 

We consider structured and unstructured grids that have an explicit or implicit 
embedding into an Euclidean space. Structured grids are Cartesian products of line 
graphs, while edges of unstructured grids are defined explicitly by an adjacency 
matrix! We assume that the maximum vertex degree is independent of the total 
number of vertices. A grid is called planar if its vertices and edges can be embedded 
into a plane without edge intersections. A grid is called a triangulation of a body 
B if it can be represented as a 1-dimensional skeleton of a simplicial partition of B. 

Cache Model. We consider a single-level, virtual-address-mapped, set-associ- 
ative data cache memory, see [3]. The memory, with a total capacity of S words, 
is organized in z sets of a ( associativity ) lines each. Each line contains w words. 
Hence, the cache can be characterized by the parameter triplet (a, in, S'), and its 
size S equals a * z * w words. A cache with parameters (S/ in, in, S ) is called fully 
associative, and a cache with parameters (l,in,S) is called direct-mapped. 

The cache memory is used as a temporary fast storage of words used for pro- 
cessing. A word at virtual address A is fetched into cache location (a(.4) , z ( .4 ) , in (.4) ) , 
where w(A) = A mod in, z(.4) = (.4 /in) mod z, and o(A) is determined according 
to a replacement policy (usually a variation of least recently used). The replacement 
policy is not important within the scope of this paper since our lower bounds are 
valid for any replacement policy and upper bounds are true even for direct mapped 
cache. 

The number of cache misses incurred in computation of K depends on the 
order in which elements of u are stored in the main memory. We assume that for 
structured grids an element u(*i, . . . , id) is stored at address A = h +n 1 i 2 +n 1 n 2 i 3 + 

■ ■ ■ + m ■ ■ ■ n d -ii d , where m , ■ • • , n*_ i are the grid sizes. For unstructured grids we 
don’t assume any particular ordering of the grid points (and, hence, elements of u). 
Instead we choose an ordering that reduces the number of cache misses. 

Replacement loads. A cache miss is defined as a request for a word of data 
that is not present in the cache at the time of the request. A cache load is defined 
as an explicit request for a word of data for which no explicit request has been 
made previously (a cold load), or whose residence in the cache has expired because 
of a cache load of another word of data into the exact same location in the cache 
(a replacement load). Cache load is used as a technical term for making some 
formulas and their proofs shorter. The definitions of cold and replacement loads 
are analogous to those of cold and replacement cache misses [7], respectively, and if 
w equals 1 they completely coincide. 

Surface-to-volume ratio. One technique for minimization of the number of re- 
placement loads is to cover the grid G = ( V , E) with conflict-free sets V = U V; , i = 
\,...,k, | Vi | = S, that is, sets without vertices mapped to the same location in 

'The graph distance is the length of a shortest path connecting two vertices. The length of the 
path is the sum of length of edges in the path. 
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cache. If we calculate q in all vertices of V* before calculating it in vertices of 
Vj, j > i , a replacement load can occur only at vertices having neighbors in at 
least two sets (boundary vertices). We consider only bounded degree graphs, so 
if we can find a covering with sets having volumes |V;| close to 5 and a minimal 
number of boundary vertices \dV { \ (and boundary edges) i.e., bodies with minimum 
surface-to-volume ratio, then the computation of K will have a number of replace- 
ment loads close to the minimum. On the other hand, the total partition boundary 
i \dVi\ can be used to obtain a lower bound for the number of replacement 
loads, see section 3, cf. [2, 9]. 

3 A lower bound for cache misses for local operators 

In this section we consider the following problem: for a given grid and a local 
operator K , how many cache misses must be incurred in order to compute q = Ku, 
where q and u are two arrays defined on the grid. We provide a lower bound for the 
number of cache misses in any algorithm, regardless of the order in which the grid 
points are visited for the computation of q . The lower bound contains the minimum 
surface-to-volume ratio of sets covering the grid. The ratio can be calculated for a 
number of grids: structured grids, planar unstructured grids, FFT-grids, expanders, 
and matrix multiplication grids described below. It may be shown that our lower 
bound is tight for all above-mentioned grids and in general for any grid which may 
be covered by sets having an optimal surface-to-volume ratio. 

We use the following terminology to describe the operator K . Locality of K 
means that the value of q at the grid point x is a function of the values u(y), where 
y is a grid point at the distance at most r from x (r is called radius of A). In 
this section we obtain a lower bound for an explicit operator of radius 1, which, 
obviously is a lower bound for operators of larger radii as well. 

3.1 Pointwise Computations 

Depending on the separability of the kernel of the local operator, it has to be 
computed in pointwise or edgewise fashion. If an operator has an unseparable 
kernel, then it requires values of u at all neighbor points simultaniously to compute 
a value of q . We call such computation pointwise. In this subsection we assume 
that computation of q on the grid G = (V, E) is performed in a pointwise fashion, 
that is, at any grid point the value of q is computed before computation of the 
value of q at another point is started. Operators with separable kernels which can 
be computed edgewise are considered in the next section. 

In order to compute the value of q at a grid point x, the values of u at the 
neighbor points of x must be loaded into the cache (points y and x are neighbours if 
they are connected by a grid edge). If x is a neighbor of y and u(y) has been loaded 
in cache to compute q( z) but is dropped from the cache before <?(x) is computed, 
then u( y) must be reloaded, resulting in a replacement miss. 

For a given algorithm to estimate the number of elements, p, of array u that 
must be replaced, we partition V into a disjoint union of k sets V’, with V = U i=1 V t , 
in such a way that q is computed at all points of V 2 before it is computed at any 
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point of VJ+i, see Figure 1. Let B[ and be (possibly intersecting) subsets of V 
which have neighbors of U j<iVj and U j>iVj respectively. The set Bi = B i U B t is 
the boundary of V z . 



Figure 1. The boundaries B\ and B\ of already computed values of q in a 
sequence of regions Vi. Reloading of some values of u on the boundary ofV 3 (heavy 
and dotted lines) results in at least \B 3 \ 4- \B 3 | — 25 cache misses. 


For computation of q in Vi the values of u at points of B\ have to be present in 
cache. The values of u in B\ already are in cache since these values are necessary for 
computing q at the neigbor points in U j<iVj and the computations in these points 
have been accomplished before computations in Vi have started. Since the cache 
can accomodate at most 5 of these values at least 


values have to be reloaded. 

Symmetrically, for computation of q in Vi the values of u at points of B f have 
to be present in cache. The values of u in B? later will be required for computation 
of q in Vj, j > i. These computations started after all computations in V z have 
been finished. Since the cache can accomodate at most 5 of these values, then at 
least 


values have to be reloaded. 

Hence, in each set of the partition at least 


Oi = p\ +p? >\B\\ + |£“| - 25 > \Bi\ - 25 
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values have to be reloaded. 

The total number of reloaded values in the course of computing q on the entire 
grid will be at least 

k k 

i" 2fcS - 

i— 1 2—1 

Let v be the maximum number of points in a set with 3 S points on its bound- 
ary. Then choosing partition V = ujL^VJ in such a way that |V*| = v we get 


Thus we have the following result. 

Theorem 1. The number of cache misses in calculation of an explicit operator on 
a grid G = (V, E) is 

P> — (|V'| + p) > -|V|(1 + ^a(G)) 

w w o 

where a(G) is the minimum of the surf ace- to -volume ratio over subsets of V with 
35 points on the boundary. 

Proof. We sum the number of replacements in (5) with the number of cold loads 
|V r | and notice that ^ = 1/3 g(G). Then we notice that each cache miss results in 
a load of w words in the cache. D 


3.2 Edgewise Computations 

The pointwise calculation model used in the previous subsection is too restrictive 
in many cases. Using separability of the kernel the number of cache misses can be 
reduced. For example, from Theorem 1 it follows a bound of p > c— for a matrix 
multiplication algorithm (since it is easy to see that a(MM) = 0(%-))- However, it 
is well known [9] that the number of cache misses for this problem is p = 8 m 
the general case, and the upper bound is achieved by a block matrix multiplication 
algorithm with block size y/S. In this section we present lower bounds for more 
general computations where values of q may be updated multiple times in arbitrary 
order. We call these computations edgewise. 

An edgewise computation is performed at the vertices of a bipartite graph 
H = (V',V°,E) where, V 1 and V° are two copies of V, and ( x , y), x € V 1 , y € V° 
is an edge in H iff x and y are neighbors in G or x = y. The values of u are given 
in the points of V 1 and values of q have to be computed at the points of V°. An 
edgwise computation is computation of a function of two variables corresponding to 
an edge of H. If a value at an end of the edge is not in cache then the computation 
suffers a cache miss. All edges of the grid should be computed. We want to estimate 
the number of cache misses that each computation of q = Ku must suffer. 
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The arguments for the obtaining the lower bound of Theorem 1 can be modi- 
fied by partitioning of the edges of H into disjoint sets E — U^ =1 E l in such a w'ay 
that computation of any edge in E t precedes computation of any edge in Ei + i and 
boundary of Ei is at least 35. Here the boundary of an edge set E { is the set of 
vertices incident to an edge in Ei and to an edge which is not in E { . The surface- 
to- volume ratio of an edge set is the ratio of the number of boundary vertices to the 
number of edges. We define / 3(H) to be the minimum surface-to-volume ratio of 
the edge sets in H having surface 35. The following result can be proved by exactly 
same arguments as used for Theorem 1. 

Theorem 2. The number of cache misses in a separable calculation of an explicit 
operator on a grid G = (V, E) is 

(*>- |£|(l+h?(G7)) 

w 3 

where /3(G) be the minimum of the surface-to-volume ratio over subsets of E with 
35 points on the boundary. 


4 Structured Grids 

4.1 Interference Lattice 

Interference lattice. Let u be a d-dimensional array defined at the vertices of a 
structured d-dimensional grid of size n\ • • * nj. Let L be a set in the index space of 
u having the same image in cache as the index (0, . . . , 0). L is a lattice in the sense 
that there is a generating set of vectors {b t }, i — 1 , . . . , d, such that L is the set of 
grid points {^t=i | Xi G Z }. We call L the interference lattice of u. It can be 

defined as the set of all vectors (z'i, . . . , Zd) that satisfy the Cache Miss Equation [7]: 


(zi -f nii ‘2 T n \ 77-2 i 3 + • • ■ + n\ ■ • • n^-iid) mod 5 — 0 . 

We will use some geometrical properties of lattices. Let B be a convex body 
of volume V\ symmetrical about the origin. The minimal A i such that A iB contains 
i linear independent vectors of L is called the i th successive minimum of a lattice L 
relative to B. A theorem by Minkowski, see [1] (Ch. VIII, Th. V), asserts that 


< uL h < t. 

d\V ~ det L ~ V 


(9) 


Note that the ratios of lattice successive minima relative to the unit cube and to 
the unit ball can be bounded: 1/d < < d. In the section 4.2 we use 

successive minima relative to the unit ball and in the section 4.3 we use successive 
minima relative to the unit cube. In any case we call / = A^/Ai the eccentricity of 
the lattice (not to be confused with eccentricity of a reduced basis, defined in [2], 
Section 4). The eccentricity relative to a ball and cube may differ by a factor of d 2 
at most. 
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In [2] we have introduced a tiling by parallelepipeds built on a reduced-basis 
of the interference lattice, which decreases the number of the cache misses to a 
level close to the theoretical lower bound that we also derived. Measurement shows 
that this tiling has significantly fewer cache loads than a compiler-optimized code. 
However, it has a high computational cost, since it depends on a significant number 
of integer parameters (29 integers for a 3D grid), and its implementation scans 
through a significant number of the grid points to select those suitable for cache 
conflict-free computations. This prompts us to consider tilings with Voronoi cells 
and with successive minima parallelepipeds. We show that these tilings have good 
surface- to- volume ratio if the lattice has a small eccentricity. 

4.2 Voronoi Tiling 

A Voronoi tiling is a tiling of the grid by completed cells C (Voronoi tile) of the 
Voronoi diagram. For each lattice point x a Voronoi cell is the set of points which are 
closer to x than to any other lattice point. All integer points inside each Voronoi 
cell are mapped into the cache without conflicts. Voronoi cells may not form a 
tiling since some integer points can be located on a cell boundary. There are many 
qualitatively equivalent ways to complete the cells to form a tiling. One way is to 
choose a basis in the space of the lattice and assign an integer point to the cell 
whose center is lexicographically closest to the point. 

In order to estimate the surface-to-volume ratio of a Voronoi cell C we note 
that the completed Voronoi cells form a tiling of space. Hence, the volume of C 
equals the determinant of the lattice, which is equal to 5, see [2]. On the other 
hand let C Q be a Voronoi cell centered at o. Each vertex v of C 0 is equidistant from 
d lattice points. Let r be that distance. According to the definition of the Voronoi 
cell, the ball of radius r, centered at u, contains no other lattice points. Hence 
r < R, where R is a radius of a maximal ball of the lattice (a lattice points free ball 
of the maximal radius). Hence, C a is contained in a ball of radius R , centered at 
o. Thus, the surface area of C 0 is bounded by the surface of a ball of the radius R , 
which equals dV d dR d ~ l where V d = r (i+d/ : 2 ) is the voIume of the unit d-dimensional 
ball (see [1] Ch. IX.7). 

We estimate the radius of the maximal ball R by induction on the dimension 
of a sublattice. Let R{ the radius of the maximal ball inscribed into the lattice L t 
built on the first i minima vectors of L. Then according Figure 2 we have 

R 2 < (hi/2) 2 + Rl i < (A t /2) 2 + Rl i 
and induction on i gives us the following assertion. 

Lemma 3. For the radius of a lattice points free ball in a d-dimensional lattice L 
we have the following relation: 

fl 2 <l/4^A ?. (11) 

1=1 

where Ai < * * • < A^ are successive minima of the lattice L. 
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Figure 2. The radius oj maximal ball inscribed into L can be estimated 
through the radius R { - 1 of the maximal ball inscribed into the lattice L»_i built on 
the first i - 1 minima vectors, and through the value of the last minimum A* = |Vi|. 
i/ere h t is the distance between Ld-i nnd Vi 4- Li - j 

Hence, for the surface area .4 of (7 we have the estimation 

A(C) = d,V d R d ~ l < d(Vd/2) d ~'V d \ d d d ~' < d V v V d f«-» a /*S«- l » d , 

where we used the estimation A^ < f d ~ 1 S derived from (9), and the bound 
R < ±A\ i{ which follows from (11)- This implies the following result. 

Theorem 4. The surface-to-volume ratio of a Voronoi cell C can be estimated as: 

- 4 ( g ) < rj f(d-lf/dQ-l/d 

V(C) - dS 

where c d is a constant depending on d only. 


4.3 Successive Minima Tiling 

The Voronoi cell tiling has cache-conflict-free tiles of maximum possible volume S, 
and of small surface-to-volume ratio. However, the tiles may have many faces, and 
it may be computationally expensive to scan through the grid points inside a tile. 
In this sense it is desirable to use rectilinear tiles. A successive minima tiling is a 
tiling by a Cartesian block built with use of successive minima lattice vectors of the 
unit cube. Such a block Q can be described by the system 

|ii| <b it t = ( 14 ) 


where \\ < bi < \d- . . , ^ 

The block Q can be constructed by the following ” inflating’ process. Take an 

initial cube of the form (14) with b t = 1, i = C ■ ■ ■ ,d, and increment until the 
face x l = b , contains a lattice point. Continue to increment values of all bj for which 
the face Xj = b, has no lattice points. At the end we obtain a block of the form 
(14) containing a lattice point on each of its faces and containing no lattice points 
inside except o. In the best case each successive minimum vector will belong to one 
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of the faces of the block, meaning that b % = X z (after an appropriate reordering of 
the coordinates). On the other side, it is not difficult to construct a 3-dimensional 
lattice such that the block = A x , b 2 = b 3 = X 2 < A 3t so the volume of the block 
would be strictly less then X\ - ■ ■ X d - 

Any translation of the block Q 1 = \Q obviously contains at most one lattice 
point and can be used for conflict free tiling. This block has a low surface- to- volume 
ratio if the lattice has bounded eccentricity, which can be seen from the following 
inequalities: 2 A(Q') < dX d ~ l and V(Q') > Af. Hence, the surface-to-volume of the 
block can be estimated as follows: 

< 2df d ~ 1 /Xi < d(d\V d ) 1/d } d ~ l S~ 1/d 

since A^ = / Ai and Ai > 2(^r-) 1/,d as follows from (9). 

As the representative example, the number of cache misses for tilings of 3- 
dimensional grids of sizes 40 < nx < 99, ny = 97, nz = 99 with successive minima 
parallelepipeds is shown in Figure 3. Experiments were performed on an SGI Origin 
2000 machine with a MIPS R 1 0000 processor. 



Figure 3. Comparison of cache misses for the second order stencil operator 
as a function of the first dimension (40 < nx < 99, ny = 97, nz — 99). The top 
graph shows the number of cache misses for the compiler optimized nest. The bottom 
graph is obtained for tilings with successive minima parallelepipeds . 


5 Unstructured Grids 

5.1 Lipton-Tarjan Covering 

In this section we present a covering of a planar bounded degree grid with sets of size 
at most 5 with average perimeter-to- volume ratio equal 0(1/ VS)- The tiling can be 
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used for computing an explicit first order operator on an n-vertex planar grid with 
0{n/VS) replacement loads. The tiling is based on the Lipton-Tarjan separator 
theorem asserting that any planar graph on n vertices has a vertex separator of the 
size 0(y/n). The separator can be constructed in O(n) time [8]. 

We consider bounded degree unstructured grids, that is, grids having fixed 
maximum vertex degree d, independent of the grid size. (Calculation on unbounded 
vertex degree grids causes approximation problems and numerical instability. As 
a result, only bounded degree grids are used in numerical methods.) For bounded 
degree grids a node cut of size 0(n) has a corresponding edge cut of size O(n), 
and vise versa. Hence, for bounded degree grids the Lipton-Tarjan separator the- 
orem (see [8], section 2, Corollary 2) can be reformulated as follows: By removing 
0{s/n) edges of an n-vertex planar graph it can be separated into connected disjoint 
subgraphs, each having at most 2n/3 vertices. 

We construct the covering by applying the Lipton-Tarjan theorem recursively. 
First, we choose any C{n) < c 0y /n cut of the original graph G = (V,E), where c 0 
is independent of n. According to Lipton-Tarjan theorem the cut can be chosen in 
such a way that it will split the graph into connected components G* = (V*, £*), i — 
1 ,■■■,*, IKI < 2n/3. Adding an extra step in this partition we can assume that 
\Vi\ < n/2 while C(n) < c iv /n for a bigger constant C\. Then we recursively bisect 
each connected component Gj = (Vi,Ei) while |V,| > S. We will call this covering 
Lipton-Tarjan covering . 

This partition process can be represented by a cut-tree T where nodes are 
partitioned connected components of the grid. A set is connected by edges with the 
nodes representing connected components obtained by removing the edges of the 
cut applied to the set. However, we do not include in T the connected components 
smaller than 5 in size which were not partitioned. To each node t of T we assign 
size s(t) equal to the number of vertices in the set represented by t and weight 
w(t) = From the definition of the cut-tree it follows that size of each leaf 

(i.e. node having no children) exceeds S. 

Lemma 5. The total number of edges in all cuts is 0{n/y/S). 

Proof . In the Lipton-Tarjan covering then the total number of the edges in all cuts 
can be bounded by 0{o{T)) : where 

a ( T ) = H ( 16 ) 

t node of T 

We use two properties of the weights: 

w(l)<n/y/S (17) 

/ leaf of T 

since the maximum of the V s (0 conditioned s (l) = n > s (0 >5+1 is attained 

at s(l) = S + 1 for alH. And the property 

u;(f) < 1/V2 w(t ) 

r is son of t 


( 18 ) 
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which follows from Proposition 7 below. 

Now cr(T) can be estimated in two steps. First, we replace weights in each 
nonleaf t by the right hand side of (18) going bottom up from the leaves to the root 
of T. This operation will not decrease the total weight. Second, carry summation 
of new weights across nodes of T by noticing that each leaf l deposits into the total 
sum at most 

+ -J= + yp + •••) = W {1)(2 + V2). 

Hence from (17) it follows that 

a(T) < (2 + -s/2 )n/VS ( 20 ) 

and total number of edges in all cuts is 0(n/y/S). □ 

Now we prove the main result of this section. From a lower bound on the 
number of replacement loads for computations of explicit operators on structured 
2-dimensional grids, see Section 3, cf. [2], it follows that the order of this bound 
can not be improved. 

Theorem 6. Any explicit first-order operator K on a bounded degree planar grid 
Q — (V, E) can be computed with 0{\V\j \fS) replacement loads. 

Proof. First, we reorder vertices of the grid in such a way that vertices of each set 
of the Lipton-Tarjan covering will occupy contiguous memory locations. Then we 
order the sets arbitrarily and compute K on each set separately. In this computation 
replacement loads can happen only at the vertices of the cuts of the grid. According 
to Lemma 5 the number of such vertices does not exceed 0(|I I / * S' ) - ^ 


Proposition 7. For any positive tq > v? > • • • > Vk such that t'i < 2 l l ^ e 

following inequality holds: 

k k 

£vGi>A 5> 

t=i N *=i 


Proof. The proof uses Jensen inequality, see [5], Ch 2.10, Th. 19: for 0 < r < s 


and in particular 


(J2 v h r ^ 

i=l i=l 

i - 1 *=1 1=1 
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Let Xi = i = 1, ■ • • , k, hence we have to prove 

(I>i) 2 > 2 E x * 

i=l i— 1 

where x x < (£* =2 X 1A /2 - T!l =2 x i- 

Let j is the minimal index such that x } > x l . Obviously. 1 < j < k 

then we have: 

k k 

Xi) 2 > ^2 x i + 2x l + ' ' • + 2x j-i + 2x A x j + i + ••• + **) 

i—1 i=l 

k ^ 

>Y^xj + 2x1 + + 2x ]-i + 2x j + 1 + • ' ’ + 2x k > 2 X] *? • 

i=i i=1 

□ 


5.2 Covering of Starry Grids 

In this section we show that cache efficiency of starry multidimensional grids is the 
same as for structured grids. 

Definition 8. We call grid starry if it can be mapped to a grid in lZ d that has the 
following two properties: 

★ the ratio of the length L of the longest edge of the grid to the length l of the 
shortest edge of the grid is limited by a constant independent of the number of grid 
points: 

L < c 0 l (26) 

** there are no grid points at a distance shorter than l. 

A vertex of a starry grid can be adjacant only to grid points contained in a 
ball of radius L centered at this point. On the other hand, a ball of radius l around 
any vertex of a starry grid is free of other grid points. Hence, a starry grid has a 
bounded degree. Another simple property of starry grids is that any subgrid of a 
starry grid is a starry grid. 

Starry grids are common in computations with particles distributed in a box 
and interacting via a short range potential. While there is no obvious way to verify 
that an abstract grid is starry , it is easy to verify that a grid in U d is starry. One 
simple way to construct a starry grid is to delete some vertices and incident them 
edges from a structured grid. Also it is easy to see that any starry grid contained 
in a cube can be completed to a starry grid triangulating the cube. 

We will show that a d-dimensional starry grid G = (V', E) can be covered by 
sets of size S having at most c|V'|^ boundary points all together, see Theorem 
11. This covering, as in the case of planar grids, is based on the Hyperplane Cut 
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Theorem , see Theorem 10, asserting that there is a hyperplane bisecting V r into 
almost equal size parts while cutting at most c\V\~^~ edges of the grid. From this 
we deduce the main result of the section. 

Theorem 9. Any explicit first-order operator K on a d- dimensional starry grid 
G = (V,E) can be computed with 0(\V\S~ * ) replacement loads. 

Proof. First, we reorder vertices of the grid in such a way that vertices of each set 
of the covering will occupy contiguous memory locations. Then we order the sets 
arbitrarily and compute K on each set separately. In this computation replacement 
loads can happen only at the vertices of the cuts of the grid. According to Theorem 
11 the number of such vertices does not exceed G{\V\S~^). □ 

From a lower bound on the number of replacement loads for computations 
of explicit operators on structured d-dimensional grids, see [2], it follows that this 
bound can not be improved. 

It is not difficult to see that any convex body has a small bisectior (for example 
a hyperplane orthogonal to a diametor of the body). It is not surprising that if a 
grid well represents the body then it has a small bisection width as well. We will 
construct sa et of 3d vectors and show that the bisecting hyperplane can be chosen to 
be a normal to one of these vectors. This actually gives an algorithm of complexity 
G(\V\ 2 ) for finding such a bisector. 

The following theorem is an analog of the Lipton-Tarjan cut theorem. 

Theorem 10. (Hyperplane Cut Theorem ) For any starry grid G = {V,E) embed- 
ded in 1Z d there is a hyperplanar cut of the size O (|V r |^) separating vertices of the 
grid onto two sets of size at least |V r |/3. 

Proof. Let us consider vectors of unit length u 2 € TZ d 7 |ui| = 1, i = 1, ■ • • 5 k and 
slabs Si bounded by hyperplanes orthogonal to Ui : {x € Tv^A* < u t x < p t } 7 

trisecting V see Figure 4, that is 

#{t/ € V\u x v < A*} > |V|/3 while #{u € V\uiv < A,} < |T|/3, (27) 

#{u € V\ UiV > pi} > \V\/Z while #{v e V\uiv > pi} < |V|/3. 

Since we have the freedom to choose u t we will choose them in such a way that all 
vertices have different projections on each Ui and the trisecting hyperplanes (27) 
exist. This choice of the slabs implies that each slab containes at least |V |/3 points 
while at most |V r |/3 grid points can be contained strictly inside a slab. We will show 
that ui, . . . ,Uk can be chosen in such a way that at least one slab is wide, that is 

hi = pi - Aj > 0(\V\ 1/d ). As explained at the end of the proof of this theorem, it 

follows from (26) that there exists a plane Hi — {v : UiV = r? z }, \i < Vi < Fi that 
intersects at most 0(|V I” 3- ) grid edges. This plane separates grid points into sets 
containing at least |V r |/3 points. 

Let V 1 be a set of grid points contained in exactly i slabs and Vf be a subset 
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s i 



Figure 4. Finding a wide bisecting slab . V 2 is the set of points inside the 
thick-lined polygons minus V 3 contained inside of the shaded polygon. 


of V x contained in S 3 . Since {V 1 } are disjoint sets and Ui<A;|V*| C V then 

'Z\v i \<\v\. 

Since each slab contains at least |V|/3 grid points then 


E \ v l\>\ v \/ 3 - 


If we sum all points in all slabs then each point in V 1 will be counted exactly i 
times, hence 

E 

l<i<k 


and 


E*ih > fivi-E^i ^ (|- rf+ 1 )i v i- ( 32 ) 

i>d i<d 


Let P m , m — (i 
different slabs, 
in P m , then 


1} . . . , i d ), i p ^ i q be parallelepipeds formed by the intersection of d 
Obviously, L) t>d V t C U m P m . Let U m be the number of grid points 


i>d 


E^ = E (f)v- > v^E^- 


i>d 


Now if we choose k = 3 d 1 then from (32) it follows that 


E^>^in (34) 
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The balls Bi of radius of l centered at the grid points do not intesect then 

Y u m < (cs^r 1 Y voi ( p ™ + (35) 

m rn 

where czl d = vol (Bi). 

Now we exercise our freedom of choosing vectors u t . We choose them to be 
normalized to length 1 rows of a 3d x d Vandermond matrix 


W =\(i- a ) j |, i = 1,.. .,3d, j = 1,.. - ,d, 0 < a < 1 

We choose a in such a way that Ui(v p — v q ) / 0 where v p ,v q are different grid points. 
These constraints give us at most 3d|V| 2 equations, and we choose a not to be a 
root any of them. Then each parallelepiped P m may be described by a system of 
inequalities 

h m < D m l W m (x — Xm) < h m 

where W m is a square Vandermond matrix consisting of m = (ii, . . . ,id) rows of 
W, h m — (hi x , . . . , h ld ) t i hi- is a halfwidth of P m in the direction of the vector u l3 , 
x m is the center point of P m and D m is the normalizing matrix. Hence P m + Bi is 
contained in the ball of radius 

Dm\Wm\\hm + J| < (3d) d \/d JJ(i - j)(h + l)< 3 2d d 2d+1 /' 2 {h + /). 

i<3 


where h — m&xi<i< 3 d{hi}- Since h > l then 

vol(P m + Bi) < c^h d . 

From this inequality together with (34) and (35) it follows 

This results in the desired lower bound for the width of a slab 

c 6 |V| 1 /d /</i. 

Now we show that there is a hyperplane parallel to the boundaries of the 
slab which intersects at most c 6 |V | ^ edges of the grid. We slice the slab on 
> CQ L\y\ l f d > C 7 1 V” 1 1 / ci of slices of the width L. Since total number of grid 
points in the slab does not exceed | | , then at least one slice contains less than 
cslV^ grid points. If a grid edge intersects bisector H of the slice then at least 
one end of the edge is inside of the slice (since the slab is L thick and the length 
of any edge does not exceed L). Hence, total number of the edges intersecting H 
does not exceed eg | V \ d < 5 , where 6 is the degree of the grid, which is bounded for 
starrv grids. Since PI is inside of the slab then it separates the grid points into parts 
containing at least |V r |/3 points each. □ 
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Now we can formulate our covering result, which implies that computations 
of explicit operators on starry grids can be performed with the same cache efficieny 
as on structured grids. 

Theorem 11. Nodes of starry grid G - (V, E) can be coverd by sets of size not 
exceeding S and with total boundary 0{\ V\/S l ^ d ). 

Proof. The proof closely follows the proof of the main result of Section 5.1. As 
mentioned in the beginning of the section, any subgrid of a starry grid is starry, 
hence we construct the covering by applying the Hyperplane Cut Theorem recur- 
sively. First, we choose any bisector cutting at most cq\V\^~ edges of the grid G , 
where cq is independent of G. According to the Hyperplane Cut Theorem the bi- 
sector can be chosen in such a way that it splits the grid into connected components 
Gt - ( Vi,Ei ), i = 1, . . .,k, \Vi\ < 2|V|/3. Adding an extra step in this partition 
we can assume that |V$| < | V | / 2 while the number of edges cut by the bisector 
does not exceed ci|V r | ^ for a bigger constant c\. Then we recursively bisect each 
connected component Gi = ( \\,Ei ) while \Vi\ > S . 

This partition process can be represented by a cut-tree T whose nodes are 
partitioned connected components of the grid. A set is connected by edges with the 
nodes representing connected components obtained by removing the edges of the 
cut applied to the set. However, we do not include in T not partitioned connected 
components smaller than S in size. To each node t of T we assign size s(t) equal to 
the number of vertices in the set represented by t and weight w(t) = . From 

the definition of the cut-tree it follows that the size of each leaf exceeds S. The 

total number of the edges in all cuts can be bounded by 0(a(T)), w^here 

a(T) = ^2 (42) 

t node of T 

We use two properties of the weights: 

^ w(l) < \V\/S— (43) 

l leaf of T 

since the maximum of s(l) * for sizes of the nodes normalized so that s (0 = 
\V\, s(l) > S + 1 is attained at s(l) = 5 + 1 for all l. And the property 

w(t) < c ^2 ( 44 ) 

t is son of t 

for some c < 1 independent on the grid, wdiich follows from Proposition 12. 

Now o{T) can be estimated in two steps. First, we replace weights in each 
nonleaf t by the right hand side of (44) going bottom up from the leaves to the root of 
T. This operation will not decrease the total weight. Second, carry the summation 
of the new weights across nodes of T by noticing that each leaf l deposits into the 
total sum at most 


w(l)( 1 + c 4- c 2 + . . . ) = w(l)cs. 
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Hence from (43) it follows that 

a(T) <c h \V\/S^. (46) 

meaning that the total number of edges in all cuts is 0( \V\/ 5~). □ 


Proposition 12. For any positive > • • • > Vk >0 such that V\ < 2 v *> ^he 

following inequality holds: 




1 Id 


Proof. The proof uses Jensen inequality, see [5], Ch 2.10, Th. 19: for 0 < r < s 

i— 1 i=l 


and in particular 

1= 1 i— 1 i=l 

Let Xi = v l / d , i — 1, . . . , k, hence we have to prove 

k k 

(£>) d >(<*+l)/2l>? 

i=l 1=1 


where ij < (£*_ 2 xf) 1/d < £* =2 x * and x i > ' ' ' > x k > 0- 

Let j be the minimal index such that x j ^ St=j+i Xi. Obviously, 1 < j < k 
then we have: 


(Y. x *) d >ll s *i +dx i 1 + ■ • • + dxj_i + dx* 1 53 x i 


=1 


i=i 


*=7 


i=j + 1 


k 


> x * + ^1 + ' ' 

• • + dx'‘ 

2 = 1 


k 


> X i + dX ' + * ' 

2=1 

■ ■ + dx d . 


+ dx d - X (Xj + i + I-Xjfc) 

k 

+ dx d j+1 +...+dxi>(d+ l)/2 *?. 

1=1 


□ 
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5.3 Cache Unfriendly 3-dimensional Grid 

In this section we construct a 3-dimensional grid of N vertices which has a subgrid 
of G the size cN that does not have small subsets with small surface-to-volume 
ratio. From this property, following the arguments of 3, it can be shown that for 
any computation of an explicit operator defined on the grid /logN) replacement 
loads must occur. 

Our construction is based on embedding an FFT butterfly graph into a tri- 
angulation of a 3-dimensional cube. The 2 n -point FFT graph, denoted as F n , is a 
graph having (n + l)2 n = N vertices arranged in n + 1 layers of 2 n vertices each. 
In other words, vertices of F n form an array (A,i)> 0 < k < n, 0 < i < 2 n — 1 and 
a vertex (A, i), k < n is connected with vertices (A + 1, i) and (k + 1, i © 2 k ) where 
i 0 2 k signifies taking the complement of k th bit of i, see Figure 5. 



Figure 5. A recursive construction of the FFT graph. F n is built from two 
copies of F n ~ i by adding (n + \) th layer of 2 n vertices and connecting them with 
vertices of n th layer by the butterflies. 


Theorem 13. 

on a grid F n is 


The number of cache misses in calculation of an explicit operator 


a > -N( 1 + 
w 



where c is a constant. 

Proof. The theorem is a direct consequence of Theorem 1 and of an estimation of 
the boundary of vertex coverings of F n given in Corollary 15. □ 


Lemma 14. For any node subset V C F n we have 

\V\<2\8V\\og\SV\ (53) 

where SV is the right boundary of V, that is, the set of points in V either on the 
right boundary of F n or having a right neighbor not in V. 
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Proof. Our proof of inequality (53) is based on induction and is similar to the 
proof of Theorem 4.1 in [9]. Let V be partitioned into three sets A, B and C, as 
shown in Figure 6. From the figure we can see that 



Figure 6. Induction step for proving the surface-to-volume inequality of a 
subset in F n . We can assume that |.4q | > | Bo | . 


\SV\ > \6A\ + \SB\ + D + min(0, \C\ - 2|.4 0 |) 

| V| < \A\ + \B\ + 2A 0 + min(0, \C\ - 2|A 0 |) 
where D — |j4o| — |Bo|- If \C\ < 2|j4o| then, by induction, 

|V| < 2(|<L4|log|<L4| + |(5B| log \6B\ + |A 0 |) < 2(|<5V'| log \SV\ - X) 

where 

X rn \SA\ log(l + + j4b) 

+ \6B\ log(l + 7^4? + + D log(|<L4| + |<fB| + D)-D-B 0 . 

Since |B 0 | < \6B\ and |B„| < |A 0 | < \SA\ and either log(l + ^ + jm\) - 1 OT 
log(l 4- j^j + y^y) > 1, or both, then X > 0. Hence |V r | < 2|<5F| log |<5V |. 

If y — min(0, \C\ - 2\A 0 \) > 0 then (53) follows from the fact that v + y < 
2 (d + y) log(cf + t/) if v < 2d log d. □ 


Corollary 15. Let F n = |J V), i = 1, . . . , k, \Vi\ < S is any partition and S < 2 n/8 . 
Then for the sum of boundaries of the sets of the partition the following inequality 
is true: 


£re)l> 

i=l 


N 

4 log S 


( 58 ) 
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Proof . For any subset V" C F n it follows from (58) that 
we can estimate the sum of boundaries of the partition. 


Ew)i>Ei^)i- 2 - 2n ^E 


i= 1 


a=l 


m 

2 ^ log | Vi | 

2!°g5 ^ 


|JV| > ||V|/ log |V|, and 


2 • 2 n 


- 2 • 2 n > 


N 

4 log 5 


Where the last inequalty holds since S < 2 n / 8 . □ 

The FFT graph can be embedded into a triangulation of a 3-dimensional cube. 
A recursive construction of the FFT graph into triangulation of simplices is shown 
in Figure 7. The simplices can be embedded into a cube as shown in Figure 9 
which then can be partitioned into parallelepipeds with further triangulation of 
each parallelepiped. 

The butterflies connecting tw*o last layers of F n can be embedded into a tri- 
angulation of a simplex in such a way that the edges of the butterflies are mapped 
onto lines of pieces ( to , 1 3, 67, 64) and (£7, £4, 63, &o) of one of the ruled surfaces 2 and 
two skewed ruled surfaces (to ? *3> 63 > bo) and (£7, *4,64, £>7)- A simplex built on the 
appropriate vertices is separated by a ruled surface into two parts as showm in Figure 
10, the top view is shown in Figure 8. The whole simplex (to? ^7> 67, bo) can be par- 
titioned into the four simplices listed above and 5 primitive simplices: (t 3 , £4, 63, 64), 
(*3, *4,64,67), (*3, *4, 63, 60), (*0**3,64,63) and (* 7 ,*4,6 4 ,6 3 ). Each of the sim pbces 
(*0**3,67,64), (*7, *4,6 3 ,6 o) 5 (*0**3,63,60) and (t 7 ,* 4 A, 67) is separated by a ruled 
surface, hence it is sufficient to build a triangulation of a simplex separated by a 
ruled surface see Figure 10. This can be done in 3 steps: 1. adding vertices on 
the edges which are not parts of the ruled surface, 2. partitioning the simplex 
onto triangular prisms, and 3. triangulating each triangular prism into 3 primitive 
simplices. 

It is easy to verify that the total number of vertices in the triangulation does 
not exceed M = 3n2 n and the degree of each node does not exceed 16. Hence we 
have constructed a triangulation having the property declared at the beginning of 
this section. 


6 Related Work and Conclusions 

The reduction of cache misses in scientific computations is an active subject of 
research. One of the first lower bounds for data movement between primary and 
secondary storage was obtained in [9] . Recently the work has focused on developing 
compiler techniques to reduce the number of cache misses. In this direction we 
mention [7], where the notion of the cache miss equation (CME) and a tiling of 

2 A ruled surface is built by linearly parametrizing two crossing lines in 3D space and connecting 
corresponding points by lines. A ruled surface can be viewed as a hyperboloid containing the two 
crossing lines. 
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Figure 7. Recursive con- 
struction of embedding of FFT graph 
into a triangulations of a simplex. 



Figure 9. Recursive triangu- 
lation of a cube. 


h 



Figure 8. Embedding one 
layer of FFT graph into a triangulations 
of a simplex, top view. 



Figure 10. Triangulation of 
a simplex separated by a ruled surface 
via adding points si,s 2 ,s 3 . Only parti- 
tion not shadowed by the ruled surface 
is shown. 


structured grids with conflict free rectilinear parallelepipeds were introduced. Some 
tight lower and upper bounds for computation of explicit operators on structured 
grids were obtained in [2], where a tiling with a reduced fundamental parallelepiped 
of the interference lattice was used for reduction of cache misses. Some practical 
methods for improving cache performance in computations of explicit operators are 

given in [10]. . 

We showed that the reduction of cache misses for computations ot explicit 
local operators defined on discretization grids is closely related to the problem of 
covering the grids with conflict free sets having low surface-to-volume ratio. We 
introduced two new coverings of structured grids: a covering with Voronoi cells and 
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a covering with rectilinear parallelepipeds built on the vectors of successive minima 
of the interference lattice. The cells of both coverings have near-minimal surface-to- 
volume ratios. Direct measurements of the cache misses show a significant advantage 
of the successive minima covering relative to computations using the natural loop 
order, maximally optimized by a compiler. We also showed that the computations of 
explicit operators on planar unstructured grids can be organized in such a way that 
the number of replacement loads is asymptotically close to one of the structured 
grids. 
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